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Abstract. We present the first deterministic sub-linear space algorithms 
for a number of fundamental problems over update data streams, such 
as, (a) point queries, (b) range-sum queries, (c) finding approximate fre- 
quent items, (d) finding approximate quantiles, (e) finding approximate 
hierarchical heavy hitters, (f) estimating inner-products, (g) construct- 
ing near-optimal B-bucket histograms, (h) estimating entropy of data 
streams, etc.. We also present new lower bound results for several prob- 
lems over update data streams. 

1 Introduction 

The data streaming model |2I29| presents a viable computational model for 
monitoring applications, for example, network monitoring, sensor networks, etc., 
where data arrives rapidly and continuously and has to be processed in an online 
fashion using sub-linear space. Some examples of fundamental data streaming 
primitives include, (a) estimating the frequency of items (point queries) and 
ranges (range-sum queries), (b) finding approximate frequent items, (c) find- 
ing approximate quantiles, (d) finding approximate hierarchical heavy hitters, 
(e) estimating inner-product, (f) constructing approximately optimal £>-bucket 
histograms, (g) estimating entropy, etc.. 

A data stream is viewed as a sequence of arrivals of the form (i, v), where, i 
is the identity of an item belonging to the domain V = {0,1, . . . , N — 1} and v is 
a non-zero integer that depicts the change in the frequency of i. v > 1 signifies 
v insertions of the item i and v < — 1 signifies \v\ deletions of i. The frequency 
of an item i is denoted by fi and is defined as the sum of the changes to its fre- 
quency since the inception of the stream, that is, /< = J2{i, v ) appears in stream v - 
If fi > for all i (i.e., deletions correspond to prior insertions) then the corre- 
sponding streaming model is referred to as the strict update streaming model 
(i.e., Turnstile model HH|)- The model where fi ^ is called the general update 
streaming model (i.e., general Turnstile model |29|b The insert-only model refers 
to data streams with no deletions, that is, v > 0. For strict update streams or 
for insert-only streams, m denotes the sum of frequencies, that is, m = X^e© 
For general update streams, L\ denotes the standard norm L\ — X^e-pl/il- 
* Work done while at IIT Kanpur. 



Prior work on deterministic algorithms over update streams. Despite the sub- 
stantive advances in algorithms for data stream processing, there are no deter- 
ministic sub-linear space algorithms for a family of fundamental problems in 
the update streaming models including, estimating the frequency of items and 
ranges, finding approximate frequent items, finding approximate 0-quantiles, 
finding approximate hierarchical heavy hitters, constructing approximately op- 
timal B-bucket histograms, estimating inner-products, estimating entropy, etc.. 
Deterministic algorithms are often indispensable in practice. For example, in a 
marketing scenario where frequent items correspond to subsidized customers, a 
false negative would correspond to a missed frequent customer, and conversely, 
in a scenario where frequent items correspond to punishable misuse |24| , a false 
positive results in an innocent victim. 

Gasieniec and Muthukrishnan 29, (page 31) briefly outline a data struc- 
ture, that we use later and will now review. We refer to this structure as the 
CR-precis structure in this paper (because the Chinese Remainder theorem 
plays a crucial role in our analysis). The structure is parameterized by a height 
parameter k and a width parameter t. Choose t consecutive prime numbers 
k < qi < q2 < . . . < qt and keep a collection of t tables Tj, for j = 1, . . . ,t, 
where, Tj has qj integer counters, numbered from 0, 1, . . . , Qj — 1. Each stream 
update of the form (i,v) is processed as follows. 

for j :— 1 to t do { Tj[i mod qj] :— Tj[i mod qj] + v } 

Lemma ^ presents the space requirement of CR-precis structure and is implicit 
in [2H|(PP- 31). Its proof is given in Appendix 1X1 

Lemma 1. The space requirement of a CH-precis structure with height param- 
eter k > 12 and width parameter t > 1 is 0(t(t + rK-) log(i + j^)(logLi)) bits. 
The time required to process a stream update is 0(t) arithmetic operations. □ 

Gasieniec and Muthukrishnan use this structure to solve the fc-set problem, 
namely, given that there are at most k items with non-zero frequency, identify 
the items and their frequencies. Let t = (k — l)log fc N + 1. With this choice 
of t, the authors argue that each of the top-fc items is isolated in some counter 
(or group) of a table in the data structure, as follows. If f x and f y are each 
non-zero, then, x and y can collide in at most log fc N counters. Otherwise, the 
difference \x — y\ < N will be divisible by log fc N + 1 different primes, each larger 
than k. The product of these primes is greater than fc logfc N+1 — kN > N — a 
contradiction. The authors state that "doing log TV non-adaptive sub-grouping 
with each other groups above will solve the problem of identifying the toppers" 1 
and then claim that the total space required is poly(k, log N). We note that the 
fc-set problem can be solved using space 0(k log 2 (miV)) bits for strict update 
streams and using space 0(k 2 log 2 (mN)) bits for general update streams [TH| 
using a different technique. The authors do not consider any of the variety of 

1 Gasieniec and Muthukrishnan state the problem as that of finding the top-fc items, 
called fc-toppers, in a stream with at most k non-zero frequencies. Hence, each item 
with non-zero frequency qualifies as a topper. 



the problems that we consider in this paper, including, estimating the frequency 
of items and ranges, finding approximate frequent items, finding approximate 
quantiles, constructing approximately optimal B-bucket histograms, estimating 
inner-product sizes, estimating entropy, etc. 

Contributions. We present the first deterministic and sub-linear space algorithms 
for a set of fundamental problems for update streams, including, estimating 
the frequency of items and ranges, finding approximate frequent items, finding 
hierarchical heavy hitters, estimating inner-products of a pair of streams, es- 
timating approximate quantiles, constructing approximately optimal i?-bucket 
histograms, estimating entropy, etc.. Gasieniec and Muthukrishnan [2H] do not 
consider any of the above-mentioned problems. We use the the data structure of 
Gasieniec and Muthukrishnan; however, our novelty lies in the effective analysis 
of the structure using the Chinese Remainder theorem (and hence we name this 
structure as the CR-precis structure). 

We also present new lower bound results. We show that any algorithm that 
returns an estimate fi of the frequency fi of an item i in a strict update stream 
satisfying |/j — fi\ < *j with probability at least |, requires J7(s(log m)(log ^)) 
bits. We also show that over general update streams, the problems of finding ap- 
proximate frequent items, finding approximate quantiles, estimating the entropy 
and estimating k th norms Lk, require f2(N) space. 

Organization. The remainder of the paper is organized as follows. In Section [31 
we define data streaming problems of interest. A more detailed review is given 
in Appendix IbI Section [31 presents the technical results in the paper. Finally, we 
conclude in Section 0] 

2 Review 

In this section, we briefly review some basic problems over data streams. 

The point query problem with parameter s is the following: given i 6 T>, 
obtain an estimate fi such that \f% — fi\ < — • For insert-only streams, the Misra- 
Gries algorithm [28], rediscovered and refined in |13I4I25| . uses slogm bits and 
returns fi such that fi < fi < fi + — . The Lossy Counting algorithm [2SI is 
also a deterministic point query estimator for insert-only streams that returns 
an estimate satisfying fi < fi < fi + ^ using s log ^ logm bits. Sticky Sampling 
algorithm |2(:i| extends the Counting Samples algorithm |17j to return an estimate 
satisfying fi — — < fi < fi with probability 1 — 5 using space 0(s log i logm) bits. 
For strict update streams, the Count-Min sketch algorithm satisfies fi<fi< 
fi + — with probability 1 — 6 using space 0(s log i log m) bits. For general update 
streams, the Count-Min sketch algorithm satisfies \ fi~fi\ < — using the same 
order of space. The Countsketch algorithm [7] is applicable for general update 
streams and satisfies \ fi — fi\ < (FJ 66 : (s) / 's) 1 / 2 < ^ with probability 1 — S using 
space 0(s log | logm), where, FJ es (s) is the sum of the squares of all but the 



top-fc frequencies in the stream. 0] show that any algorithm that returns /» 
satisfying \fi — fi\ < must use f2(slog^) bits. 

An item i is said to be frequent with respect to parameter s provided |/»| > 
^- L . Since, finding all and only frequent items requires fl{N) space |12I25| . re- 
search has focused on the following problem of finding e-approximate frequent 
items, where, < e < 1 is a parameter: return all frequent items but do not 
return any i such that |/,| < £=±>Ll. |7ll2lllllfll7l2^l28l2W| . As reviewed 
in Appendix [5] algorithms for finding frequent items typically use point query 
estimators and return all items whose estimated frequency exceeds the threshold 
for frequent items. ^H] uses Count-Min sketches for finding frequent items over 
strict update streams with probability 1 — 6 using 0(| log s lo s W s ) l g HL logm) 
bits. The hierarchical heavy hitters problem |9I10I14I2"3] is a generalizes the fre- 
quent items problems to hierarchical domains (see Appendix [BJi. 

Given a range [I, r] from the domain T>, the range frequency is defined as 
f[i,r] — y^x— l f x - The range-sum query problem with parameter s is: given a 
range [l,r], return an estimate fu r ] such that \f\i, r ] ~ f[l,r]\ < ~- A standard 
approach is to decompose a given interval as a canonical disjoint sum of at most 
2 log TV dyadic intervals [23j (See Appendix© . [TO] uses Count-Min sketches to 
estimate range-sums using space 0(s log l ° s N log N log m) bits and with prob- 
ability 1 — S. Given < <f> < 1 and j — 1,2, ... , [</> _1 ], an e-approximate j 
(j)-quantile is an item aj such that (jcf> — e)m < X^" 1 fi < (j(j> + e)rn. The prob- 
lem has been studied in jl()l2H19l2*7] . For insert-only streams, [21] presents an 
algorithm requiring space 0((log e _1 ) log(em)) for insert-only streams. For strict 
update streams, the problem of finding approximate quantiles can be reduced 
to that of estimating range sums (See Appendix |B|) . [TU] uses Count-Min 
sketches to find e-approximate 0-quantiles with confidence 1 — 6 using space 
0(ilog 2 A(log^)). 

A B-bucket histogram h is an A^-dimensional vector with B interval-value 
pairs as follows. Divide the domain T> = {0, 1, . . . , N — 1} into B non-overlapping 
intervals, say, I\, I2, ■ ■ ■ , Ib- For each interval Ij, choose a value Vj. Then h is 
the vector such that for each i e V, hi = vj, where, Ij is the unique interval 
containing i. The cost of a £>-bucket histogram h with respect to the frequency 
vector / is defined as ||/ — h\\ = J2f=i J2ieij (fi~ v j) 2 - Let h opt denote an optimal 
B-bucket histogram satisfying || / - h opt \ \ = min B _ b uckct histogram h\\f -h\\. The 
problem is to find a £?-bucket histogram h such that 1 1 / — h\ | < ( 1 + e) 1 1 / — h opt 1 1 . 
An algorithm for this problem is presented in a seminal paper |18j using space 
and time poly (B, -, logm, log N) and improved in [22] . 

Given two streams R and S with item frequency vectors / and g respectively, 
the inner product f ■ g is defined as X^g-p /* ' 9i- ^he problem is to return an 
estimate P satisfying \P — / • g\ < A. The work in presents a space lower 
bound of s — Q(J^). Randomized algorithms |1I8I15| match the space lower 
bound, up to poly-logarithmic factors (See Appendix iBl) 

The entropy of a data stream is defined as H = Yliev ^17 1°S T%T- ^ is a 
measure of the randomness, or, the incompressibility of the stream. The prob- 



lem is to return an e-approximate estimate H satisfying \H — H\ < eH. For 
insert-only streams, jS] presents an e-approximate entropy estimator that uses 
space 0(j2 log | log 3 to) bits and also shows an ^( e 2 log 1 ( 1 / e ) ) space lower bound 
for estimating entropy. For update streams, |3] presents an e-approximate es- 
timator that requires space <3((e -3 log 5 TO)(log e _1 )(log S" 1 )). For a > 1, an 
a- approximation for H is an estimate H such that HaT 1 < H < Hb such that 
ab < a. a-approximate estimators of H are presented in [231 using 0(N« log N) 
bits and in using 0(min(TO 2 / 3 , to "+ 1 )) bits [5]. 

We note that sub-linear space deterministic algorithms are not known for 
any of the above-mentioned problems. 

3 CR-precis structure for update streams 

In this section, we use the CR-precis structure to present algorithms for a family 
of basic problems over update streams. 

An application of the Chinese Remainder Theorem. Consider a CR-precis struc- 
ture with height k and width t. Fix x £ {0, . . . , iV— 1}. Suppose J C {1,2, ... ,t} 
such that | J| > log fe N. How many items y from the domain {0, 1, . . . , N — 1} 
map to the same bucket as x in each of the tables Tj, for j £ J? By Chinese 
Remainder theorem, there is a unique solution in the range < y < YijeJ H ~ ^ 
to the equations x = y mod qj, for each j e J. Since, Y\j eJ qj > k logkN — N, 
it follows that the only solution for y e {0, . . . , N — 1} is x. Therefore, for any 
given x, y £ {0, 1, . . . , N — 1} such that x ^ y, 

\{j \ y = x mod Qj and 1 < j < t}\ < \og k N - 1 . (1) 
3.1 Algorithms for strict update streams 

In this section, we use the CR-precis structure to design algorithms over strict 
update streams. 

Point Queries. Consider a CR-precis structure with height k and width t. The 
frequency of x G V is estimated as: f x = min* =1 Tj[x mod qj]. The accuracy 
guarantees are given by Lemma |21 

Lemma 2. For < x < N - 1, < f x - f x < (logfe f -1) (m - f x ). 

Proof. Clearly, Tj[x mod qj] > f x . Therefore, f x > f x . Further, 
t t 

tfx < T 3 i X m0d Qj] = + Y fv ■ 

j=l j=l y^ X 

y=x mod qj 



t 

Thus, t (f x -f x ) = j2 E fv = E E 

j=l y/x y/x j:y=x mod ^ 

y=x mod 

= E^I^' : y = xmod *'}l - ( lo SkN -l)(m- f x ), by Q . □ 

If we let k — s and t = s \og s N, then, the space requirement of the point query 
estimator is 0(s 2 (log s iV) 2 (log m)) bits. The time required to obtain the estimate 
is 0(t) = 0(s \og s N) arithmetic operations. A slightly improved guarantee that 
is often useful for the point query estimator is given by Lemma Here, m res (s) 
is the sum of all but the top-A: frequencies |3I7| . 

Lemma 3. Consider a CH-precis structure with height s and width 2slog s N. 
Then, for any < x < N - 1, < f x < m ''" (s) . 

Proof. Let yi, y%, . . . , y s denote the items with the top-s frequencies in the stream 
(with ties broken arbitrarily). By QJ, x conflicts with each yj ^ x in at most 
log s N buckets. Hence, the total number of buckets at which x conflicts with any 
of the top-s frequent items is at most s log s N. Thus there are at least i — s log s N 
tables where, x does not conflict with any of the top-s frequencies. Applying the 
proof of Lemma [3 to only these set of t — s \og s N > s log s N tables, the role of 
m is replaced by m res (s). This proves the lemma. □ 

As reviewed in Section and Appendix [5] the problems of estimating range- 
sums, finding approximate frequent items, finding approximate hierarchical heavy 
hitters and e-approximate quantiles essentially reduce to point query estimators 
associated with simple hierarchical data structures (e.g., dyadic interval hierar- 
chy). Further, in the robust -B-bucket histogram structure of ^Hjj the role of 
sketches can be replaced by CR-precis structure. Theorem 0] states the space 
versus accuracy guarantees for these problems over strict update streams. In ad- 
dition, the structure can be used to deterministically obtain approximate top-fc 
wavelet coefficients and fourier transform coefficients over update streams — we 
omit the details for brevity. 

Theorem 4. 1. There exists a deterministic algorithm for finding e-approximate 
frequent items with parameter s using space 0(^2 (log « N)(log ~) log — (logm)) 
The time taken to process each stream update is 0(|(logs N)logN) arith- 
metic operations. 

2. There exists a deterministic algorithm for range-sum query estimator with 
parameters using space 0(s 2 (log 2 . N) (logs+loglog s N)) (log m) log ./V) bits. 
The time required for processing a stream update is 0(s(\og s N) (logiV)) 
arithmetic operations. 

3. For e < ({), e-approximate (p-quantiles may be deterministically computed using 

space 0(^-(log 5 iV) (logm) (log log + log ^) _1 ) bits. The time taken for pro- 
cessing a stream update is (9(i(log 2 iV) (log log N + log and for finding 
each quantile is 0(i(log 2 N)(\og \)(\og log TV + log 



4. There exists a deterministic algorithm for finding e-approximate hierarchi- 
cal heavy hitters using space 0(e~ 2 s 4 h 2 log(^p) log m) bits where, h is the 
height of the hierarchy. 

5. There exists a deterministic algorithm for constructing (1 — e)-optimal B- 
bucket histograms using space poly (B, -, logm, logiV). □ 

Estimating inner product and join sizes. Let hir — Eiex> ft and let mj = 
J2iev 9i- We mam tain a CR-precis for each of the streams R and S, that have 
the same height k 7 same width t and use the same prime numbers as the table 
sizes. For j = 1, 2, . . . , t, let Tj and Uj respectively denote the tables maintained 
for streams R and S corresponding to the prime qj respectively. The estimate P 
for the inner product is calculated as P = min' =1 EtLi Tj[b]Uj[b], 

Lemma 5. f-g<P<f-g + ( los ^ — J m^ms. 

Proof. For j = l,...,t, Eg^iPW] > J2 x =b mod qj f^9x = f • 9- 

Thus, P> f -g. Further, 

t qj t 

tp ^ E E ^ w ^ = *(/ • ^) + E E 

j=l 6=1 j=l Xj^y 

x=y mod qj 

= t(f-g)+ f*9y E 1 

x,y:x^ty j-x=y mod ^ 

< t(f ■ g) + (log fc N - l)(m R m s - f ■ g), by 0. □ 

Since / • 5 can be thought of as the size of the natural join of the streams R and 
S (i.e., \R tx S\), this shows that \R X 5| can be approximated up to additive 
error of ^f^- using 0(s 2 (log^)(log s)(log(m))) bits. 

Estimating entropy. A deterministic algorithm that returns an a-approximation 
of H can be designed as follows. We maintain a CR-precis structure of height 
k > 2 and width t = 2m i ™ , where, a, e and e are parameters. We first use 
the point queries estimator to find all items x with f x > 22 with additive er- 

ror of < f x - U < Therefore, f x > (f x - E -f)(l- f = f' x (say). 

The estimated contribution to entropy by the frequent items is calculated as 
Hd = V„ f , . 2m — log-S-. Next, we remove the estimated contribution of the 
frequent items from the tables as follows. 

Tj[i modqj]:—Tj[i mod qj] — for each i s.t. /j > — and j = 1, . . . , t. 

H s estimates the contribution to H by the non- frequent items as follows. 
Hs = j E { T A b \ lo g ifffe] I 1 < & < 3j and T A b \ < ■ The estimate for H 
is returned as H = Hd + -ff s . The space versus accuracy guarantees of the 
algorithm are summarized in the following lemma. 



Lemma 6. For < e, e < j and a > 1, There exists a deterministic algorithm 
that returns an estimate H satisfying Htyl ~ e ^ < H < (1 + e)i/ using space 
0(^m lii ^ £i (log 4 m + log 4 7V)) &zts. □ 

The proof of Lemma basically uses equation and is omitted for brevity. 
For insert-only streams, an improvement can be obtained by using algorithm 
Freguent [28113125] instead of CR-precis to find the frequent items (and then 
reducing the CR-precis as before). This gives an a-approximation to H using 
space 0(-^m~sr (log 4 m + log 4 iV)) and matches the space complexity of the 
earlier randomized schemes of [S 23 , up to poly- logarithmic factors. 

Lower bound. A standard result 0] shows that any point query estimator with 
error at most — requires /2(slog-^) bits. For strict update streams, we show 
stronger lower bounds in Lemmas [3 and Lemma |H1 

Lemma 7. For s < -^p, any deterministic algorithm that satisfies — fi\ < ^ 
for any i eD over strict update streams requires f2(s(logm) log ^) space. 

Proof. Consider a stream consisting of s 2 distinct items, organized into s levels 
with s items per level. The frequency of an item at level I is set to ti = [—\ . 
Let mi denote the sum of the frequencies of the items in levels 1 through I. Let 
s' = 8s. We apply the algorithm A(s') to obtain the identities of the items, 
level by level. At iteration r, where, r = 1, . . . , s in succession, we maintain the 
invariant that items in levels higher than s — r + 1 have been discovered and their 
(exact) frequencies are deleted from the current stream. Let I = s — r + 1. By 
the invariant, at the beginning of iteration r, the frequencies are organized into 
levels 1 through I and m = mj. At iteration r, we return the set of items whose 
estimated frequencies according to A(s') is at least ti — 7 -^. Thus, all items at 
level I are returned. Further, it can be argued that the estimated frequencies of 

the other items do not cross ti as follows. We have m; = 53|'=i s ' L^~J < 2 i+1 . 
Therefore, ti — ti-% = ^— > ^r. At iteration r, the items at level s — r + 1 
are found and their frequencies are deducted. In this manner, after s iterations, 
the level by level arrangement of the items can be reconstructed. The number 
of such arrangements is ( N ), where, the s's are repeated s times in the 
multinomial coefficient. Thus, A(s') requires space log ( ) = f2(s 2 log ^-), 
since, N > 64s 2 . Since s' = 8s, we have that A(8s) requires J?(s 2 log^) bits. 
The space required is J?(s 2 log — ) = Q(s(logm) log — ). This proves the claim 
for deterministic algorithms. □ 

Lemma 8. For s < ^p-, any randomized algorithm that satisfies |/j — /j| < 
with probability at least | over strict update streams requires £2(s(\og m)(log ^)) 
bits. 

Proof. Consider the bit-vector indexing problem, where, the input is a bit vector 
v of size n that is presented in full, followed by an index i between 1 and n. The 



problem is to decide whether v[i] = 1 or not. This problem requires space f2(n) 
by any randomized algorithm that gives the correct answer with probability §. 
We can solve the bit-vector indexing problem with n — \_s 2 log using a point 
query estimator. 

For simplicity, let s divide N. A segment r of log N indices starting at index 1 
mod log y> that is, r = a log ^ + 1, ■ ■ ■ , log ~, is mapped to a pair (A T , l T ), 

where, A T G {a + 1, . . . , (a + l)Ns} and l T G {1,2,..., s}. The mapping A T is 
defined as follows. First we map the set S T = {j — a log ^ | j G r and = 1} 
to a number ^ r between and — . Clearly, there are 2 log t- = ^ possibilities for 
SV- A T is a log(iVs) bit number whose bit representation is a o i/ r , that is, the 
higher order 2 log s bits of A T are those of a and the lower order log — bits are 
those of v T . The level of r is l T and is the logs-bit number &2iogs<22iogs-2 ■ ■ ■ &2 
where a is the 2 logs bit number a — a2iogs&2iogs-i • ■ ■ o-i- Finally, the frequency 
of A T is set to f\ T — 2 1t . Since each l T is a logs-bit number, there are s levels. 
Since, l T — a2io gs ci2iogs-2 ■ ■ 'i2, the number of r's with the same value of l T 
is the number of possible combinations of the odd bit positions of a, that is, 
a2iogs-i, Q2iogs-3, • ■ ■ j 0,1. Since, there are logs such positions, the number of 
segments r with the same value of l T is exactly 2 logs = s. Moreover, from the 
construction, it follows that the mapping of segments r to pairs (A r , l T ) is 1-1, 
onto and efficiently constructible by storing only one segment at a time. 

If the error probability of the point estimator is at most 1 — giy, then, it 

follows using the argument of Lemma [7] that all j with v[j] = 1 are retrieved 

2 i 

with total error probability bounded by ^ = ^ . Given a point query estimator 
that satisfies |/j — fo\ < ^ with probability |, by returning the median of 
O(logs) independent estimators boosts the confidence to 1 — Hence, the 
space complexity is ^(^(logmXlog 

The above argument can be slightly improved as follows. For each i, there 
exists many permutations of the domain 1, . . . , s 2 log — such that the query index 
i is contained in the segment r that is mapped to the highest level s. In this 
configuration, if the point query estimator is invoked to obtain an estimate of 
f T , then, by the argument of LemmaEI fr is completely predicted and therefore, 
it can be correctly inferred as to whether v[i] is 1 or not. Hence, the space 
complexity is /2(s(logm)(log — ). □ 



3.2 General update streaming model 

In this section, we consider the general update streaming model. Lemma El sum- 
marizes the point query estimator for general update streams. 

Lemma 9. Given a CH-precis structure with height k and width t. For x G T>, 
let L = \ E* =1 Tj[x mod q 3 ]. Then, \f x - f x \ < ^f- 1} (L x - 



Proof. tf x = Y?j=i T 3 l x mod Qj] = */x+Z)j=i J2ifv \ y^xa,ndy = x mod qj}. 



t 

Thu S ,t|/ x -/*i = iE E /»i= E E u\ 

j — l y^ x V^ x j : V= x mod qj 

y=x mod qj 

< E E ^ (logfc^-ij^i-iM), by m □ 

y^a: j:y=x mod qj 

Similarly, we can obtain an estimator for the inner-product of streams R and S. 
Let L\(R) and L±(S) be the Li norms of streams i? and 5* respectively. 

Lemma 10. Consider a CH-precis structure of height k and width t. Let P = 
lEUZLiTiWM- Then, \P - f ■ g\ < L^L^S). □ 

Lemma 11. Deterministic algorithms for the following problems in the general 
update streaming model requires f2(N) bits: (1) finding e-approximate frequent 
items with parameter s for any e < \, (2) finding e-approximate 4>-quantiles for 

any e < <p/2, (3) estimating the k th norm — (^2^—^\fa\ k ) l ^ k , for any real 
value ofk, to within any multiplicative approximation factor, and (4) estimating 
entropy to within any multiplicative approximation factor. 

Proof. Consider a family T of sets of size ^ elements each such that the inter- 
section between any two sets of the family does not exceed . It can be shown 2 
that there exist such families of size 2 n ^ N \ Corresponding to each set S in the 
family we construct a stream str(S) such that fa = 1 if % G S and fa — 0, other- 
wise. Denote by str\ o str-i the stream where the updates of stream str^ follow 
the updates of stream str\ in sequence. Let A be a deterministic frequent items 
algorithm. Suppose that after processing two distinct sets S and T from T ', the 
same memory pattern of ^4's store results. Let A be a stream of deletions that 
deletes all but | items from str(S). Since, Li(str(S) o A) = |, all remaining 
| items are found as frequent items. Further, Li(str(T) o A) > ^ — |, since, 
\S fl T\ < If s < y> t nen ' ~T > ■"•> and therefore, none of the items qualify 
as frequent. Since, str(S) and str{T) are mapped to the same bit pattern, so 
are str(S) o A and str(T) o A. Thus A makes an error in reporting frequent 
items in at least one of the two latter streams. Therefore, A must assign dis- 
tinct bit patterns to each str(S), for S G J-. Since, \T\ = 2 n< - N \ A requires 
J? (log (| JF|)) = f2(N) bits, proving part (1) of the lemma. 

Let S and T be sets from T such that str(S) and str(T) result in the same 
memory pattern of a quantile algorithm Q. Let A be a stream that deletes all 
items from S and then adds item with frequency fa — 1 to the stream. Now 
all quantiles of str(S) o A = 0. str(T) o A has at least distinct items, each 
with frequency 1. Thus, for every (j> < \ and e < i the kth <f> quantile of the 

2 Number of sets that are within a distance of -j from a given set of size y is 

y-f /AT/2N 2 < 2 ( N / 2 ) 2 Therefore \T\ > , > 2 " /2 - i (1^.) N/8 

Zvr=o U i - z liv/sJ • ln ereiore, ^ 2 /jv/2\^ - 2 (3e) N /s — 2 I3J 



two streams are different by at least k(j>N. Part (3) is proved by letting A be an 
update stream that deletes all elements from str(S). Then, Lk(str{S) o A) = 
and L k (str(T) o A) = Q{N 1 l k ). 

Proceeding as above, suppose A is an update stream that deletes all but one 
element from str(S). Then, H(str(S) o A) = 0. str(T) o A has f2(N) elements 
and therefore H(str(T) o A) = \ogN + 0(1). The multiplicative gap log TV : is 
arbitrarily large — this proves part (4) of the lemma. □ 

4 Conclusions 

We present the first deterministic sub-linear space algorithms for a number of 
fundamental problems over update data streams, including, point queries, range- 
sum queries, finding approximate frequent items, finding approximate quantiles, 
finding approximate hierarchical heavy hitters, estimating inner- products, con- 
structing near-optimal -B-bucket histograms, estimating entropy of data streams, 
estimating entropy, etc.. We also present new lower bound results for several 
problems over update data streams. 
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A Proof of Lemma [T] 

Proof. Consider a CR-precis structure of height k > 6 and width t. Denote the 
n th prime by p n . By Rosser's theorem |50|. p n < n(\nn + logn), for n > 6. It 
follows that if a = ln fc _^ log fc , then, p a < a(\na + log a) < k. Letting c = 1 + In 2, 
we have, 

t a+t a+t 

/Jgj < Pn < X! CU ln 71 - CPa+t + C 
j—\ n—a n—a 



xlnx dx < cpa+t + c 



x lux 



Simplifying the RHS, we obtain the statement of the lemma. 



□ 



B Review 

In this Appendix, we present some more details of the basic techniques used in 
data stream processing with emphasis on processing update streams. 

Preliminaries. A dyadic interval at level I is an interval of size 2 l from the 
family of intervals {[i2 l , (i + l)2 l - 1],0 < i < \§ ] - 1}, for < I < \ogN, 
assuming that N is a power of 2. The set of dyadic intervals of levels through 
log N form a complete binary tree as follows. The root of the tree is the single 
dyadic interval [0, N — 1]. The nodes at distance h from the root are the set of 
dyadic intervals at level logN — h. Moreover, for < h < logiV, each dyadic 
interval at level h is of the form Iy L = [ip-, (i + — 1] and has two children 
at level h — 1, namely, the left and the right halves of The left child of 
Ih is the interval pi^rr, (2i + l)oFFT — 1] an d the right child is the interval 
[(2^ + 1)^,(2^ + 2)^]. 

Point query estimators can either make one-sided errors or two-sided errors. 
Point estimators with one-sided errors are either over-estimators, that is, fi < 
fi < fi + — (for e.g., Count-Min sketch 201): or, under-estimators, that is, 
fi — — < fi < fi (e.g., Counting Samples |17|. Lossy Counting |26|). Point 
estimators with two-sided errors return estimates satisfying |/; — fi\ < — (for 
e.g., Countsketch [7j). Algorithms for finding e-approximate frequent items 
with parameter s typically use point query estimators with parameter s' = -. 
For example, using a one-sided over-estimator, one can return all items i such 
that fi > — . Estimators with two sided errors can be used to return all items i 
such that fi > fi — —j . The problem of efficiently finding e-approximate frequent 
items can be solved by keeping a point query estimator corresponding to each 
dyadic level I = 0, . . . ^og^ HH- By construction, each item i belongs to a 
unique dyadic interval at level /, namely, the I th level ancestor of the interval 
[i,i] in the dyadic tree. The "items" at level I are the set of dyadic intervals 
{[j2 l , (j + l)2 l — l]} 0< j <2 d-i and are identifiable with the domain {0, 1, ... , 2 d ~ 1 }. 
With this interpretation, an arrival over the stream of the form (i, v) is processed 
as follows: update the item {{i % 2 l ), v) for each level / = 0, 1, . . . , [log —J . The 
frequency of a dyadic interval / is defined as the sum of the individual frequencies 
of items in /, and is denoted as //. Since each level item belongs to one and 
only one dyadic interval at a given level I, the sum of the interval frequencies at 
level I is the same as the sum of the item frequencies at level 0, which is to. If an 
item i is frequent (i.e., fi > — ), then the dyadic interval that contains i at any 
level I has frequency at least fi and is therefore also frequent at level I. Hence, 
at each level / starting from [log — J and decrementing down to 1, it suffices to 
consider only those dyadic intervals that are frequent at level The procedure 
begins by enumerating O(s) dyadic intervals at level [log —J and keeping as 
candidate intervals whose estimated frequency is at least ^. In general, at level 
I, there are O(s) candidate intervals. For each candidate interval at level I, we 



consider its left and right child intervals at level 1 — 1, and repeat the procedure. 
Since, at any level, the number of candidate intervals is O(s), the total number 
of intervals considered in the iterations is 0(s log ~). Using the Count-Min 
sketch algorithm at each dyadic level with total space 0(| log slo s(^/ s ) \ g 
counters, one can return all frequent items with probability 1 and not return 
any item with frequency ( 1 ~^ m with probability 1 — S. No sub-linear space 
deterministic algorithms are known for update streams. 

The hierarchical heavy hitters problem |9ll4j is a useful generalization of the 
frequent items problem for domains that have a natural hierarchy (e.g., domain 
of IP addresses). Given a hierarchy, the frequency of a node X is defined as the 
sum of the frequencies of the leaf nodes (i.e., items) in the sub-tree rooted at 
X. The definition of hierarchical heavy hitter node (HHH) is inductive: a leaf 
node x is an HHH node provided f x > — . An internal node is an HHH node 
provided that its frequency, after discounting the frequency of all its descendant 
HHH nodes, is at least 1 j. The problem is, (a) to find all nodes that are HHH 
nodes, and, (b) to not output any node whose frequency, after discounting the 
frequencies of descendant HHH nodes, is below (iz^lm This problem has been 
studied in |9I1UI14I2~4*] . As shown in (5], it can be solved by using a simple bottom- 
up traversal of the hierarchy, identifying the frequent items at each level, and 
then subtracting the estimates of the frequent items at a level from the estimated 
frequency of its parent ;9' . Using Count-Min sketch, the space complexity is 
0(£(l og £i^)(logiV)(logrn)) bits. E3 presents an f2(s 2 ) space lower bound 
for this problem. Deterministic algorithms for finding HHH items over update 
streams are not known. 

The range-sum query problem, that is, estimating the frequency of a given 
range, can be solved by using the technique of dyadic intervals |2(J| . Any range 
can be uniquely decomposed into the disjoint union of at most 21ogA^ dyadic 
intervals of maximum size (for example, over the domain {0, . . . , 15}, the inter- 
val [3,12] = [3,3] + [4,7] + [8,11] + [12,12]). The technique is to keep a point 
query estimator corresponding to each dyadic level I — 0, 1, ... , log N — 1. The 
range-sum query is estimated as the sum of the estimates of the frequencies 
of each of the constituent maximal dyadic intervals of the given range. Us- 
ing Count-Min sketch at each level, this can be accomplished using space 
0(s log log N log to) bits with probability 1 - 5 10 1. The problem of find- 
ing e-approximate 0-quantiles can be reduced to range-sum queries as follows. 
For each k = 1,2, ...,<j> , a binary search is performed over the domain to 
find an item such that the range sum f\ ak ,N-l] ues between (k<p — e)m and 
(k(f>+e)m. A technique for constructing and maintaining (1 — e)- optimal B -bucket 
histograms over strict update streams is presented in |18| using space and time 
poly (B, i, log to, log N) and improved in |22) . 

For estimating the inner-product / • g, pQ presents the product of sketches 
technique using space 0(s log |) counters with additive error of 0(^(F2(R)F 2 (S)) 1 / 2 ), 

where, F 2 (R) = Yliev fi an< ^ ^(S) = J2ii£v9i- DO & ^ so presents a space lower 
bound of s = Q(m 2 /{f ■ g)) for estimating / • g. Count-Min sketches QU| can 
be used to return an estimate that has additive error of m 2 / s with probability 



S using space 0(s log i). The product of sketches algorithm is improved 
to match the space lower bound. 



